Smoothing and compression with stochastic k-testable tree languages
نویسندگان
چکیده
In this paper, we describe some techniques to learn probabilistic k-testable tree models, a generalization of the well known k-gram models, that can be used to compress or classify structured data. These models are easy to infer from samples and allow for incremental updates. Moreover, as shown here, backing-off schemes can be defined to solve data sparseness, a problem that often arises when using trees to represent the data. These features make them suitable to compress structured data files at a better rate than string-based methods.
منابع مشابه
A probabilistic extension of locally testable tree languages
Probabilistic k-testable models (usually known as k-gram models in the case of strings) can be easily identified from samples and allow for smoothing techniques to deal with unseen events. In this paper we introduce the family of stochastic k-testable tree languages and describe how these models can approximate any stochastic rational tree language. This is applied, as a particular case, to the...
متن کاملK-TLSS(S) language models for speech recognition
The class of K-Testable Languages in the Strict Sense (K-TLSS) is a subclass of regular languages. Previous works demonstrate that stochastic K-TLSS language models describe the same probability distribution as N-gram models, and that smoothing techniques can be e ciently applied (Back-o like methods). Once we have a set of k-TLSS models (k = 1 : : :K) and a smoothing technique that specificall...
متن کاملLearning k-Testable tree sets from positive data
A k-Testable tree set in the Strict sense (k-TS) is essentially defined by a finite set of patterns of "size" k that are permitted to appear in the trees of the tree language. Given a positive sample S of trees over a ranked alphabet, an algorithm is proposed which obtains the smallest k-TS tree set containing S. The proposed algorithm is polynomial on the size of S and identifies the class of ...
متن کاملStochastic K-TSS Bi-Languages for Machine Translation
One of the approaches to statistical machine translation is based on joint probability distributions over some source and target languages. In this work we propose to model the joint probability distribution by stochastic regular bi-languages. Specifically we introduce the stochastic k-testable in the strict sense bi-languages to represent the joint probability distribution of source and target...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Pattern Recognition
دوره 38 شماره
صفحات -
تاریخ انتشار 2005